A hemispheric two-channel code accounts for binaural unmasking in humans

Sound in noise is better detected or understood if target and masking sources originate from different locations. Mammalian physiology suggests that the neurocomputational process that underlies this binaural unmasking is based on two hemispheric channels that encode interaural differences in their relative neuronal activity. Here, we introduce a mathematical formulation of the two-channel model – the complex-valued correlation coefficient. We show that this formulation quantifies the amount of temporal fluctuations in interaural differences, which we suggest underlie binaural unmasking. We applied this model to an extensive library of psychoacoustic experiments, accounting for 98% of the variance across eight studies. Combining physiological plausibility with its success in explaining behavioral data, the proposed mechanism is a significant step towards a unified understanding of binaural unmasking and the encoding of interaural differences in general.

The motivation for this manuscript is that while the Jeffress model has had good success at characterizing a large battery of prior psychophysical data on tone-in-noise masking, two-channel models have not performed well. This paper develops a new model based on a mathematical model of two-channel IPD coding. The model is quite simple and elegant. It essentially captures the correlation of signals at the two ears and additionally a running computation of the IPD. The model is applied to eight different datasets and does a remarkable job of describing the results, with 98% of the variance explained. Even using a single set of parameters for the model instead of optimizing parameters for each dataset, the model explains 93% of the total variance. The paper is well written and concise. The results support the conclusions. And the manuscript makes an important contribution and advancement to the field. I have only a few suggestions.
First, in the introduction it is stated that the mammalian brainstem is vastly different than that of the Jeffress model or the barn owl. However, really it is only that delay lines are missing in mammals, and thus no topographical map of ITD as prescribed by Jeffress. But it is precisely this mapping that prior models of binaural unmasking have leveraged to great success. As stated by the authors, while the Jeffress-based models do well, this is not the architecture of the mammalian brainstem! The authors might make these points in the introduction.
One of the most difficult binaural unmasking datasets to model even with Jeffress are 'frozen' noise, such as the experiments of Gikley, Colburn and Carney to name just a few. Jeffress-type models can describe average performance over all tokens, but fail to predict performance for individual noise tokens. Sample to sample variation in the noise tokens from trial to trial was mentioned on line 288 of the manuscript as a difficult situation for the present model to handle. Modeling frozen noise binaural unmasking is beyond the scope of the present manuscript, but it might be mentioned as a difficult data set historically to model. The present model might also have difficulty describing these data.
Finally, beginning on line 216 there is a discussion about the necessity of a form of normalization for the correlation and how this might occur physiologically. In addition to the methods suggested, it should be noted that there are also brainstem neurons that act as anti-coincidence detectors, in the lateral superior olive. These neurons are just as sensitive to ITD/IPD as the MSO neurons (see Tollin and Yin, 2005, Journal of Neuroscience). As a stimulus is decorrelated, MSO neurons reduce their responses but LSO neurons increase response. A comparison of the MSO and LSO responses is a proxy for interaural correlation. There are neurons at the next stage, the inferior colliculus, that receive inputs from both MSO and LSO neurons (Shackleton et al 2000, Hearing Research).
Reviewer #2 (Remarks to the Author): The authors propose a new mathematical formulation for the two-channel model of binaural sound localization in mammals: a complex-valued correlation coefficient gamma that contains the two channels as its real and imaginary parts. The new formulation allows to explain binaural unmasking and other psychoacoustic phenomena beyond spatial sound localization.
The manuscript is well written and the Introduction is succint and clear, with a nice review of the field. The results are clearly presented and the figures are of very high quality. The results show impressive fitting of many different works in the literature. The Discussion also includes some limitations of the proposed model, although some of them could be improved (see below). The manuscript seems to be already reviewed, as it is very good and for instance has a very low number of typos.
In summary, I suggest publication after the following minor suggestions are addressed.
Minor suggestions: -Please add a summary of the two-channel model in the Introduction (the Jeffress model is briefly described, yet the two-channel model isn't).
-Line 298: Equation (4) uses $S_{uu}$ and $S_{zz}$ but in the next line it changes to $S_{ll}$ and $S_{rr}$. Please correct -Line 302: "and of" --> "and" -Lines 245-265: The description of the first described limitation (asymmetric IPD-rate functions) is very messy and it doesn't clearly state the limitation, as oppossed to the subsequent limitations which are very clearly stated (line 267 "In the N_0 S_0 condition shown in Fig. 2h, the model deviates significantly from the data [...] This effect cannot be accounted for by the current model implementation", line 271 "The current model also neglects peripheral processing apart from bandpass filtering. Without this pre-processing, the model cannot account for effects that are", line 274 "Another limitation of the current implementation is its focus on a single frequency channel centered at 500 Hz"). Please clearly state the limitation.
Reviewer #3 (Remarks to the Author): Encke and Dietz describe a simple and elegant model of binaural unmaksing, that is quite mathematical in nature, but whose simple approach is inspired by neurophysiological data regarding sound localisation in small mammals. This data suggests that Jeffress' (1948) model for the detection of interaural time delays is quite overspecified. In Jeffress' model there is an array of cells in the brainstem that is innervated by both ears via axons that vary systematically in length. The variation in length creates differential conduction times from the two ears and so action potentials arriving simultaneously at a given neuron will encode an external sound with a particular interaural delay. The putative cells fire when such coincidences occur. The neurophysiology suggests, however, that there is not a finely tuned array of these cells, but two populations that are broadly tuned to just two different interaural delays; intermediate delays can thus be coded by the relative activation of these two populations. The present paper extends this idea to the phenomenon of binaural unmasking. It further simplifies it through a mathematical abstraction; because the tuning of the two populations of cells is at approximately +/-45 degrees in each frequency band, the two channels appear to be in quadrature phase and so can efficiently encode all possible phase differences. The model is therefore reduced to mathematically deriving a phase and a coherence value in each band (the coherence can be thought of as the maximum of the cross-correlation function, but is defined slightly differently in footnote 1). Although it is clear where the equivalent phase would be derived physiologically, I was less clear to me where the coherence would come from in the brain using only the activations the two cell populations. A clearer statement is needed there.
The model includes an explicit integration of monaural and binaural cues using signal detection theory, which is nice to see.
A wide range of data is successfully fitted using the model, as shown in Figure 2. I would support the use of individually fitted model parameters for each dataset, because there are substantial quantitative differences between studies for the same conditions, which prevents a common set of parameters from producing a good fit across data sets. The choice of studies seems very appropriate, but they are all using 500 Hz as the signal frequency and in my experience getting correct predictions across frequency is one of the hardest things to do in this area. Using broadband noise, the binaural unmasking effect (the "BMLD") decreases with increasing frequency, but asymptotes to 2-3 dB above 1500 Hz. As I understand it, this mathematical model would behave almost identically at different frequencies. The authors remark that "the model cannot account for effects that are associated with the periphery" (line 272) which appears to be a slightly oblique reference to this problem, because most models in this area tackle the effect of frequency using a model of peripheral transduction. Such models reduce, with increasing frequency, the encoding of the temporal fine structure of the waveform, but conserve the encoding of the waveform envelope. I thought this dicsussion could be more explicit.
Another problem with the purely mathematical treatment is that the model uses phase information directly. As a consequence, it makes identical predictions for the conditions known as NoSpi and NpiSo, for which the interaural phases of the noise and the signal both differ by pi radians. These conditions consistently differ empirically, with NoSpi giving a larger effect. The authors address this issue in the Discussion (from line 245) by invoking the asymmetric shape of the oberved IPD rate functions (illustrated in Fig. 4a). Again, a more detailed physiological implementation is invoked as needed to capture all features of the data, and, again, I wonder about the effect of frequency. The difference between NoSpi and NpiSo is quite small at 500 Hz, but it grows substantially at lower frequencies. I would like to know how the rate function shape would predict this effect. Would it require greater asymmetry at lower frequenies, or would the same thing have a larger effect on predictions?
Writing on lines 259 and 261 the word "then" appears to be used in place of "than" Authors' Response to Reviews of A hemispheric two-channel code accounts for binaural unmasking in humans. mance on a battery of prior human psychophysical data. For over 7 decades, the prevailing theory of binaural processing based on the interaural time difference (ITD, or equivalently, the interaural phase difference for a known frequency) cue to sound source location was the so-called Jeffress model. This model postulated three things: 1) phase locking of peripheral neuron spiking to the stimulus at each ear, 2) physical delay lines (longer axon path lengths) that compensate for the external acoustic ITD, and 3) coincidence detecting neurons that receive input from both ears and that fire maximally when action potentials arrive coincidently. Finally, Jeffress postulated that there is a population of coincidence detectors that are differentially sensitive to a range of ITDs. Thus, a time-based cue, ITD, is converted to a place code, in essence a topographical mapping of ITD. The Jeffress neural architecture has been demonstrated in the anatomy and physiology of the barn owl brainstem. In the mammalian brainstem, however, phase locking and coincidence detecting neurons have been documented, but delay lines in the form of physically longer axons from the two ears has not been found. And thus, nothing approaching a topographical map of ITD has been found in the brainstem. More recently it has been proposed that the mammalian binaural system for ITD/IPD coding must different than the classical Jeffress model. A two-channel hemispheric model of ITD/IPD coding has gathered support. The motivation for this manuscript is that while the Jeffress model has had good success at characterizing a large battery of prior psychophysical data on tone-in-noise masking, two-channel models have not performed well. This paper develops a new model based on a mathematical model of two-channel IPD coding. The model is quite simple and elegant. It essentially captures the correlation of signals at the two ears and additionally a running computation of the IPD. The model is applied to eight different datasets and does a remarkable job of describing the results, with 98% of the variance explained. Even using a single set of parameters for the model instead of optimizing parameters for each dataset, the model explains 93% of the total variance. The paper is well written and concise. The results support the conclusions. And the manuscript makes an important contribution and advancement to the field.
AR: We thank Reviewer 1 for this positive evaluation of our manuscript. We have uploaded a version of the manuscript where all changes to the previous version are indicated. We implement the changes as suggested (see below). Line numbers in this document refer to those in the pdf with tracked changes.

Introduction
RC: First, in the introduction it is stated that the mammalian brainstem is vastly different than that of the Jeffress model or the barn owl. However, really it is only that delay lines are missing in mammals, and thus no topographical map of ITD as prescribed by Jeffress. But it is precisely this mapping that prior models of binaural unmasking have leveraged to great success. As stated by the authors, while the Jeffress-based models do well, this is not the architecture of the mammalian brainstem! The authors might make these points in the introduction.

RC:
One of the most difficult binaural unmasking datasets to model even with Jeffress are 'frozen' noise, such as the experiments of Gikley, Colburn and Carney to name just a few. Jeffress-type models can describe average performance over all tokens, but fail to predict performance for individual noise tokens. Sample to sample variation in the noise tokens from trial to trial was mentioned on line 288 of the manuscript as a difficult situation for the present model to handle. Modeling frozen noise binaural unmasking is beyond the scope of the present manuscript, but it might be mentioned as a difficult data set historically to model. The present model might also have difficulty describing these data.
AR: Thank you for this very positive review! We have addressed your suggestions as discussed below. Line numbers in this document refer to those in the pdf with tracked changes.

Reviewer #3
5. General Comment RC: Encke and Dietz describe a simple and elegant model of binaural unmaksing, that is quite mathematical in nature, but whose simple approach is inspired by neurophysiological data regarding sound localisation in small mammals. This data suggests that Jeffress' (1948) model for the detection of interaural time delays is quite overspecified. In Jeffress' model there is an array of cells in the brainstem that is innervated by both ears via axons that vary systematically in length. The variation in length creates differential conduction times from the two ears and so action potentials arriving simultaneously at a given neuron will encode an external sound with a particular interaural delay. The putative cells fire when such coincidences occur. The neurophysiology suggests, however, that there is not a finely tuned array of these cells, but two populations that are broadly tuned to just two different interaural delays; intermediate delays can thus be coded by the relative activation of these two populations. The present paper extends this idea to the phenomenon of binaural unmasking. It further simplifies it through a mathematical abstraction; because the tuning of the two populations of cells is at approximately +/-45 degrees in each frequency band, the two channels appear to be in quadrature phase and so can efficiently encode all possible phase differences. The model is therefore reduced to mathematically deriving a phase and a coherence value in each band (the coherence can be thought of as the maximum of the cross-correlation function, but is defined slightly differently in footnote 1).
AR: We thank the reviewer for the positive evaluation of the manuscript and the insightful comments. We have addressed all comments (please see responses below). Line numbers in this document refer to those in the pdf with tracked changes.

Comment 1
RC: Although it is clear where the equivalent phase would be derived physiologically, I was less clear to me where the coherence would come from in the brain using only the activations the two cell populations. A clearer statement is needed there.

Comment 2
RC: A wide range of data is successfully fitted using the model, as shown in Figure 2. I would support the use of individually fitted model parameters for each dataset, because there are substantial quantitative differences between studies for the same conditions, which prevents a common set of parameters from producing a good fit across data sets. The choice of studies seems very appropriate, but they are all using 500 Hz as the signal frequency and in my experience getting correct predictions across frequency is one of the hardest things to do in this area. Using broadband noise, the binaural unmasking effect (the "BMLD") decreases with increasing frequency, but asymptotes to 2-3 dB above 1500 Hz. As I understand it, this mathematical model would behave almost identically at different frequencies. The authors remark that "the model cannot account for effects that are associated with the periphery" (line 272) which appears to be a slightly oblique reference to this problem, because most models in this area tackle the effect of frequency using a model of peripheral transduction. Such models reduce, with increasing frequency, the encoding of the temporal fine structure of the waveform, but conserve the encoding of the waveform envelope. I thought this dicsussion could be more explicit.
Another limitation of the current implementation is its focus on a single frequency channel centered at 500 Hz. With different parametrization, the model can be reasonably expected to account for experiments at different frequencies. Other experiments, however, such as those employing spectrally complex maskers or maskers constructed from two noise sources with different ITDs, might require a multi-channel implementation of the model. The success of this kind of model extension has recently been demonstrated [22]. [. . . ]

Comment 3
RC: Another problem with the purely mathematical treatment is that the model uses phase information directly. As a consequence, it makes identical predictions for the conditions known as NoSpi and NpiSo, for which the interaural phases of the noise and the signal both differ by pi radians. These conditions consistently differ empirically, with NoSpi giving a larger effect. The authors address this issue in the Discussion (from line 245) by invoking the asymmetric shape of the oberved IPD rate functions (illustrated in Fig. 4a). Again, a more detailed physiological implementation is invoked as needed to capture all features of the data, and, again, I wonder about the effect of frequency. The difference between NoSpi and NpiSo is quite small at 500 Hz, but it grows substantially at lower frequencies. I would like to know how the rate function shape would predict this effect. Would it require greater asymmetry at lower frequenies, or would the same thing have a larger effect on predictions?

lines 259 and 261
RC: the word "then" appears to be used in place of "than" 6 AR: Fixed